
As new and returning residents of New York, we are intrigued by the bustling rat population around us and across each borough. We are interested in exploring the impact of factors like day of the week, month, borough, latitude, longitude, and type of building has on rat sighting around New York. By understanding what influences the amount and location of rat sightings, we will be able to know which areas of the city to avoid and which areas may have the world’s best ratatouille.
Brown rats are not indigenous to New York; they are a product of the colonization of North America and were brought over on ships in the 18th century. The brown rat is the most common variety in New York today and slowly overtook the black rat population due to their highly aggressive and dominant nature. Brown rats’ ability to tunnel through and eat just about anything has lead them to be one of the top pest in New York for millennia. A testament to their cultural relevance, the Rolling Stones 1978 record, Shattered, made a reference to the rats of New York City: “We’ve got rats on the west side”. More recently, these omnivorous creatures have even spurred the creation of a city government position titled Director of Rodent Mitigation or, colloquially, “Rat Czar”.
We set out to answer the following questions:
Throughout the course of working on the project, we became interested in sightings year-over-year and included some time series plots in the analysis.
Our is publicly available from Open Data NYC (https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe), downloaded in November, 2023. The raw data contains 232,090 records of rat sightings and variables relating to geographical location, type of location, and time of sighting. In order to begin the data cleaning and analysis process, we loaded the following libraries:
tidyverselubridatereadrxtsRColorBrewerggthemesgridExtraleaflethighcharterscaleslibrary(tidyverse)
library(lubridate)
library(readr)
library(xts)
library("RColorBrewer")
library("ggthemes")
library("gridExtra")
library("leaflet")
library(leaflet.extras)
library("highcharter")
library(scales)
We begin by importing the rat sightings data using the
read_csv function, clean up the variable names with the
clean_names function, and create some more useful date
variables in a mutate pipeline.
rats_raw <- read_csv("./Rat_Sightings.csv", na = c("", "NA", "N/A", "Unspecified")) %>%
janitor::clean_names() %>%
mutate(created_date = mdy_hms(created_date)) %>%
mutate(sighting_year = year(created_date),
sighting_month_num = month(created_date),
sighting_month = month(created_date, label = TRUE, abbr = FALSE),
sighting_day = day(created_date),
sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE))
There are 232,090 records of rat sightings, ranging from 2010 to 2023 and across all 5 boroughs.
Important variables to our analysis include:
created_date: Date of rat sighting recordsighting_year: Year of sightingsighting_month: Month of sightingsighting_day: Sighting day of the monthsighting_weekday: Sighting day of the weeklocation_type: Rat sighting location type (Government
Building, 3+ Family Apt. Building, Construction site, etc.)city: City of sightingborough: Borough of sightinglatitude: Latitude of sightinglongitude: Longitude of sightingWe first explored how rat sightings vary over time (month, day of the week, year) and how rat sightings vary by borough. To do so, we used simple tables, bar charts, line plots, heat maps, and interactive maps. First, we visualize rat sightings by borough, different measures of time (year, month, and day of the week), and the different types of locations they are found at.
by_borough <- rats_raw %>%
filter(!is.na(borough)) %>%
group_by(borough) %>%
count() %>%
ggplot(aes(x = borough, y = n, fill = n)) +
geom_histogram(stat = "identity", position = "dodge") +
theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
xlab("Borough") +
ylab("Count") +
geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
ggtitle('Count of Rat Sightings by Borough') +
scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral')))
by_borough

rat_borough <-
rats_raw %>%
group_by(borough, sighting_month, sighting_year) %>%
mutate(rat_per_month = n()) %>%
slice(1) %>%
select(borough, sighting_month, sighting_year, rat_per_month)
test_borough <- broom::tidy(oneway.test(rat_per_month ~ borough, data = rat_borough))
We see that Brooklyn has the highest amount of rats, followed by Manhattan, the Bronx, Queens, and Staten Island. Running a one-way ANOVA test on the number of monthly rat sightings by borough, we see that the p-value is 3.7065548^{-147} and at least one of the borough means is not equal to the others.
by_year <- rats_raw %>%
group_by(sighting_year) %>%
count() %>%
ggplot(aes(x = sighting_year, y = n, fill = n)) +
geom_histogram(stat = "identity", position = "dodge") +
theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
xlab("Year") +
ylab("Count") +
geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
ggtitle('Count of Rat Sightings through the Years') +
scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral')))
by_year

rat_year <-
rats_raw %>%
group_by(sighting_month, sighting_year) %>%
mutate(rat_per_month = n()) %>%
slice(1) %>%
select(sighting_month, sighting_year, rat_per_month)
test_year <- broom::tidy(oneway.test(rat_per_month ~ sighting_year, data = rat_year))
We see a substantial increase in the number of rat sightings after 2020. This increase is consistent with the city of New York’s rat media coverage and the impact of the COVID-19 pandemic. With more restaurants closed and more restaurants offering outdoor dining, rats are more likely to scavenge outside. A warmer, wetter than usual summer in 2021 also contributed to favorable rat conditions. Running a one-way ANOVA test on the number of monthly rat sightings by year, we see that the p-value is 3.9316995^{-13} so at least one of the borough means is not equal to the others.
by_month <- rats_raw %>%
group_by(sighting_month) %>%
count() %>%
ggplot(aes(x = sighting_month, y = n, fill = n)) +
geom_histogram(stat = "identity", position = "dodge") +
theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 9)) +
xlab("Month") +
ylab("Count") +
geom_text(aes(label = n), vjust = -0.1, size = 3.75) +
ggtitle('Count of Rat Sightings by Month') +
scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral')))
by_month

rat_monthly <-
rats_raw %>%
group_by(sighting_month, sighting_year) %>%
mutate(rat_per_month = n()) %>%
slice(1) %>%
select(sighting_month, sighting_year, rat_per_month)
test_month <- broom::tidy(oneway.test(rat_per_month ~ sighting_month, data = rat_monthly))
The most rat sightings are in the summer months with a peak in July. Sightings taper off in the fall, reaching a low in December, and then start to increase in the spring. Warmer weather is more favorable to rat survival and helps their populations grow. Running a one-way ANOVA test on the number of rat sightings by month, we see that the p-value is 7.6361763^{-8} so at least one of the month means is not equal to the others.
by_day <- rats_raw %>%
group_by(sighting_weekday) %>%
count() %>%
ggplot(aes(x = sighting_weekday, y = n, fill = n)) +
geom_histogram(stat = "identity", position = "dodge") +
theme(legend.position ='none',axis.title = element_text(),axis.text.x = element_text(size = 12)) +
xlab("Weekday") +
ylab("Count") +
geom_text(aes(label = n), vjust = -0.1, size = 4) +
ggtitle('Count of Rat Sightings by Day of Week') +
scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral')))
by_day

rat_dow <-
rats_raw %>%
group_by(sighting_weekday, sighting_month, sighting_year) %>%
mutate(rat_per_month = n()) %>%
slice(1) %>%
select(sighting_weekday, sighting_month, sighting_year, rat_per_month)
test_dow <- broom::tidy(oneway.test(rat_per_month ~ sighting_weekday, data = rat_dow))
Weekdays have the most rat sightings, peaking on Mondays and staying relatively high throughout the week, while weekends have much lower counts. Running a one-way ANOVA test on the number of rat sightings by day of the week, we see that the p-value is 7.6361763^{-8} so at least one of the day of the week means is not equal to the others.
for_location_type <- rats_raw %>%
drop_na(location_type) %>%
filter(location_type != "Other (Explain Below)") %>%
group_by(location_type) %>%
mutate(count_loc = n()) %>%
ungroup() %>%
filter(location_type %in% c("3+ Family Apt. Building", "1-2 Family Dwelling", "3+ Family Mixed Use Building", "Commercial Building", "Vacant Lot", "Construction Site"))
ggplot(data = for_location_type, aes(x = fct_infreq(location_type))) +
geom_bar() +
theme_minimal() +
coord_flip() +
labs(title = "Top 6 Location Types for Sightings",
x = "Location Type",
y = "Count")

rat_location <-
rats_raw %>%
filter(location_type %in% c("3+ Family Apt. Building", "1-2 Family Dwelling", "3+ Family Mixed Use Building", "Commercial Building", "Vacant Lot", "Construction Site")) %>%
group_by(location_type, sighting_month, sighting_year) %>%
mutate(rat_per_month = n()) %>%
slice(1) %>%
select(location_type, sighting_month, sighting_year, rat_per_month)
test_location <- broom::tidy(oneway.test(rat_per_month ~ location_type, data = rat_location))
The above shows the top 6 location types for rat sightings. 3+ Family Apt. Buildings report the highest amount of rat sightings among all location types, while 1-2 Family Dwellings and 3+ Family Mixed Use Buildings report the next two highest amount of sightings. These location types are followed by commercial buildings, vacant lots, and construction sites. Running an ANOVA for monthly rat sightings against the top 6 location types, there is a p-value of 5.9127557^{-157} so at least one of the location type means is not equal to the others.
In order to display rat sightings across New York City, we opted to create an interactive heat map, showing the hot spots for rat sightings, and a time series plot.
Focusing in on Columbia University’s Presbyterian Hospital campus at 168th Street, there is a lack of hot spots, but the immediate surrounding areas appear to have fairly normal sightings.
## Overall Sightings Map and Heat Map
top = 40.917577 # north lat
left = -74.259090 # west long
right = -73.700272 # east long
bottom = 40.477399 # south lat
nyc = rats_raw %>%
filter(latitude >= bottom) %>%
filter ( latitude <= top) %>%
filter( longitude >= left ) %>%
filter(longitude <= right)
center_lon = median(nyc$longitude,na.rm = TRUE)
center_lat = median(nyc$latitude,na.rm = TRUE)
factpal = colorFactor("blue", nyc$n)
nyc %>%
leaflet() %>%
addProviderTiles("Esri.NatGeoWorldMap") %>%
addHeatmap(lng = ~longitude, lat = ~latitude, intensity = ~(nyc$n), blur = 20, max = 0.05, radius = 15) %>%
setView(lng=center_lon, lat=center_lat,zoom = 10)
Over the 13 years of data, rat sightings have increased and display a cyclical pattern each year. Rat sightings drop in the colder months and rise in the warmer months each year. Rat sightings in 2020-2023 appear to have increased
overall <- rats_raw %>%
group_by(sighting_year, sighting_month_num, sighting_day) %>%
summarize(count = n()) %>%
mutate(date = as.Date(paste(sighting_year, sighting_month_num, sighting_day, sep = "-")))
time_series = xts(overall$count , order.by= overall$date)
hchart(time_series, name = "Rat Sightings") %>%
hc_add_theme(hc_theme_darkunica()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of New York", style = list(fontSize = "12px")) %>%
hc_title(text = "Time Series of NYC Rat Sightings") %>%
hc_legend(enabled = TRUE)
Insert
Insert
Our project aimed to investigate the factors that contribute to rat sightings in New York City. We were interested in how rat sightings vary by different measures of time, borough, location type and what variables are most important.
To investigate the impact of these variables, we created a series of exploratory plots that displaying the relationship between rat sightings and year, month, day of the week, borough, and location type. We followed each up with an ANOVA test to test the relationship between monthly rat sightings and each group; all were statistically significant. We learned that rat sightings have increased year-to-year, with large increases since 2020. This is likely due to more rats scavenging outside due to pandemic-induced restaurant closures and summers becoming significantly warmer and wetter than summers in prior years. We also learned that Brooklyn has the most rat sightings of all boroughs and apartment buildings with 3 or more units have the most rat sightings of all location types.
In addition to our data visualizations, we conducted linear models
predicting latitude and longitude & rat sightings per month. We
found 4 predictors that are helpful in predicting monthly rat sightings
(location_type, sighting_year,
borough, and season), with coefficients in the
expected directions according to our exploratory plots.
Through this analysis, we now understand some of the factors that influence the amount and location of rat sightings in New York City. It became clear that there’s no avoiding our furry neighbors.